IN THE CLAIMS: 

Kindly replace the cla ims of record with the following full set of claims: 

1 . (Currently amended) An audio-visual system for processing video data 
comprising: 

an object detection module capable of providing a plurality of object features 
from the video data, said object features sel ected from the group of: temporal and sp atial 
feature domains : 

an audio processor module capable of providing a plurality of audio features from 
the video data, said audio features se lected from the aroup consisting of: two or more of 
the following: average energv. pitch, zero crossing, bandwidth, band central, roll off Inw 
ratio, spectral flux and 12 MFCC components : 

a processor coupled to the object detection and the audio segmentation modules, 
arranged to determine a maximum correlation value among a plurality ofcorrelation 
values betM^oon the plurality of object foaturoo and the plurality of audi e^^atwes, wherein 
cacliof said correlation values js^are determined as the sum of the correlation of the 
selected elements ffl- a subset of said audio features ^.^lontoH fm^ tho group concicting of 
two or more of the follov/ing: avorago onorgy, pitch, zero croosing, bandvndth, band 
central, roll off, lov.- ratio, spectral flux and 12 MFCC components , and a selected one of 
the object features. 

2. (Original) The system of claim 1, wherein the processor is further arranged to 
determine whether an animated object in the video data is associated with audio. 

3. (Cancelled) 

4. (Original)The system of claim 2, wherein the animated object is a face and the 
processor is arranged to determine whether the face is speaking. 



Amendment After Final Rejection 
Serial No. 10/076,194 



Docket No. US020037 



5. (Previously presented)The system of claim 4, wherein the plurality of object features 
are eigenfaces that represent global features of the face. 

6. (Previously presented)The system of claim 1, further comprising: 

a latent semantic indexing module coupled to the processor and that preprocesses 
the plurality of object features and the plurality of audio features before the correlation is 
performed. 

7. (Original) The system of claim 6, wherein the latent semantic indexing module 
includes a singular value decomposition module. 

8. (Currently amended)A method for identifying a speaking person within video data, the 
method comprising the steps of 

receiving video data including image and audio information; 

determining a plurality of face image features from one or more faces in the video 
data, said image features selected from the group of temt)oral and spatial feature 
domains : 

determining a plurality of audio features related to audio information , said audio 
features select ed fro m the ruoup consisting of: two or more of the following: averat^ e 
energy, pitch, zero crossing, bandwidth , band central, roll off, low r atio, spectral flux and 
12 MFCC components : 

calculating correlation values betwoon the plurality of face imago foaturos and the 
audio f e atures , wherein each of said correlation values js are determined as the sum of the 
corre lation values of the selected elements of in n r.nhn. n t ofolomonta said audio features 
seloctod from the group conaisting of two or more of the following: a\'cragc energy, 
pitch, z e ro croaoing, bandwidth, band c e ntral, roll off, low ratio, opcctral flux and 12 
MFCC components , and each of said selected face image features; and 

determining the speaking person based on a maximum of the correlation values. 
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9. (Previously presented) The method according to claim 8, further comprising the step 
of: 

normalizing the face image features and the audio features. 

10. (Previously presented) The method accordmg to claim 9, further comprising the step 
of: 

performing a singular value decomposition on the normalized face image features 
and the audio features. 

11. (Original) The method according to claim 8, wherein the determining step includes 
determining the speaking person based upon the one or more faces that has the largest 
correlation. 

12. (Original) The method according to claim 10, wherein the calculating step includes 
forming a matrix of the face image features and the audio features. 

13. (Previously presented) The method according to claim 12, further comprising the step 
of: 

performing an optimal approximate fit using smaller matrices as compared to full 
rank matrices formed by the face image features and the audio features. 

14. (Original) The method according to claim 13, wherein the rank of the smaller 
matrices is chosen to remove noise and unrelated information from the full rank matrices. 

15. (Currently amended) A memory medium including code for processing a video 
including images and audio, the code comprising: 

code to obtain a plurality of object features from the video , said object features 
selected from th e ^ roup o f : temporal and spatial feature domains : 

code to obtain a plurality of audio features from the video , said audio features 
selected from t he sroup consisting of: two or more of the following: average energy , 
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pitch, zero c ross in g, bandwidth, band central, roll off, low ratio, sp ect ral flux and 1 2 
MFCC components : 

code to determine correlation values between the plurality of object features and 
the plurality of audio features, wherein each of said correlation values is are detemiined 
as the sum o f the correlation values of the selected elements of in a r.uhr.nt of elements 
said audio features select e d from the group consisting of two or more of the followi ng: 
averag e cncrg)^ pitch, zero crosoing, bandwidth, band central, roll off, low ratio, opoctral 
flux and 12 MFCC componcntG , and each of said selected object features; and 

code to determine an association between one or more objects in the video and the 
audio based on a maximum of the correlation values. 

1 6. (Original)The memory medium of claim 1 5, wherein the one or more objects 
comprises one or more faces. 

17. (Original)The memory medium of claim 16, further comprising code to determine a 
speaking face. 

18. (Previously presented)The memory medium of claim 15, further comprising: 

code to create a matrix using the plurality of object features and the audio features 
and code to perform a singular value decomposition on the matrix. 

19. (Previously presented)The memory medium of claim 1 8, further comprising: 

code to perform an optimal approximate fit using smaller matrices as compared 
to full rank matrices formed by the object features and the audio features. 

20. (Original)The memory medium according to claim 19, wherein the rank of the 
smaller matrices is chosen to remove noise and unrelated information from the full rank 
matrices. 
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